PARADOCS: A Language Independant Go-Between for Mating Parallel Documents
نویسندگان
چکیده
Parallel corpora are the bread and butter of a number of machine translation technologies. Therefore, important efforts are regularly spent in acquiring new ones. This task often involves a rather cumbersome manual inspection and it is rather difficult to set up a strategy that fits all the needs. We thus developed PARADOCS, a system aiming at doing this automatically. Our solution exploits numerical entities in documents in order to pair them. A classifier trained to recognize parallel text coupled to an information retrieval engine controlling the search space of candidate pairs are the main components of our approach. We tested PARADOCS on a number of tasks involving numerous pairs of languages and report good results. MOTS-CLÉS : corpus parallèles, recherche d’information, traduction automatique.
منابع مشابه
Identifying Parallel Documents from a Large Bilingual Collection of Texts: Application to Parallel Article Extraction in Wikipedia
While several recent works on dealing with large bilingual collections of texts, e.g. (Smith et al., 2010), seek for extracting parallel sentences from comparable corpora, we present PARADOCS, a system designed to recognize pairs of parallel documents in a (large) bilingual collection of texts. We show that this system outperforms a fair baseline (Enright and Kondrak, 2007) in a number of contr...
متن کاملComparing k-means clusters on parallel Persian-English corpus
This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...
متن کاملEnglish - Oromo Machine Translation: An Experiment Using a Statistical Approach
This paper deals with translation of English documents to Oromo using statistical methods. Whereas English is the lingua franca of online information, Oromo, despite its relative wide distribution within Ethiopia and neighbouring countries like Kenya and Somalia, is one of the most resource scarce languages. The paper has two main goals: one is to test how far we can go with the available limit...
متن کاملCross-Lingual Topical Relevance Models
Cross-lingual relevance modelling (CLRLM) is a state-of-the-art technique for cross-lingual information retrieval (CLIR) which integrates query term disambiguation and expansion in a unified framework, to directly estimate a model of relevant documents in the target language starting with a query in the source language. However, CLRLM involves integrating a translation model either on the docum...
متن کاملOn the Link between Identity Processing and Learning Styles among Young Language learners
The present study attempted to investigate the probable relationship between Iranian young language learners’ identity processing styles and their learning styles. To this end, 29 advanced learners, 23 females and 6 males were randomly selected from an English language Institute. Twenty nine advanced young language learners were chosen randomly out of whole advanced young language learners in t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- TAL
دوره 51 شماره
صفحات -
تاریخ انتشار 2010